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ABSTRACT 

Evidence on how the psychometric properties of verbal 
and quantitative academic aptitude tests are affected when item 
ootions are weighted using rather simple conceptual procedures is 
presented. This is discussed in connection with the scoring methods 
used on the Graduate Record Examinations. (DG) 
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The study we carried out took a look at some of the issues related 
to empirical option weighting using a large and representative data base. 
We hoped to obtain some fairly general answers to the following questions! 

(1) What happens to the internal consistency and parallel forms reli- 
ability of a test keyed to increase parallel forms reliability or 
internal consistency? 

(2) Does either type of keying result in an increase in validity over 
conventional scoring methods either for individual sub-tests or 
when verbal and quantitative tests are combined to obtain a 
multiple correlation? 

(3) If the answer to the last question is yes, which of the two methods 
of keying seems to offer the most premise? 

In part, the study attempted to replicate the findings of Hendriksen 
(1971) and Davis and Fifer (1959) with a high level aptitude test, the 
Graduate Record Examinations (GRE). Both these studies produced evidence 
indicating that by empirically weighting options, reliability can be 
increased by practically significant amounts . 

It was hoped that the study would provide further evidence on how 
the psychometric properties of verbal and quantitative academic aptitude 
tests are affected when options are keyed using rather conceptually simple 
procedures. 

Method 

Test Forms 

The first step, was: to. devise two parallel forms each, of the verbal 
(denoted as and V^). and quantitative (Q^ and Qg) sections of the GRE, 
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by assigning one-half of the items on each section to each of the two 

special parallel forms. Forms and consisted of $0 items each 

while forms and consisted of 27 items. It should be noted that the forms 

within each set were not administered under separate time limits, 

since the forms were constructed from operational tests. 'While the more 

desirable procedure would have been to administer the two parallel forms 

under separately timed conditions, this was not possible. The GRE, 

however, is considered by most definitions to be a power test so that 

any effects due to correlated speed components should have been negligible. 

Sample 

Next, a space sample of 5*000 answer sheets from the December 1970 
administration of the GRE was taken for stucjy purposes. A second sample 
(sample C) consisting of tne answer sheets of individuals frcm the same 
administration was taken for validation purposes. The first sample was 

t 

divided into two randomized block groups of 2500 (samples A and B) by 
blocking m (hetotal GRE score (V + Q). This increased the probability that 
total score means and standard deviations for these two groups were 
approximately equal. 

Keying Procedures 

(l) Two different types of keying were carried out. The first, 
designed to increase internal consistency was similar to that 
described by Hendriksen (1971). The procedure first scored 
each sub-form using the conventional scoring formula (i.e.* rights - \ 
wrongs) and then for each item keyed each option including the 
emit category, by assigning the mean standard score on the remaining 
items for all persons choosing that option. 




3 



-3- 



We departed in one respect from Hendriksen's method in that we 
did not perform any iterations . The second procedure was similar 
to the one employed by Davis and Fifer (1959) and assigned to each 
option of an item the mean standard score on the corresponding 
parallel sub-form of all individuals choosing that option. 

Analyses 

The next step was to score each sub-form in Sample A using the 
weights derived in Sample B and vice-versa. Thus, for each sub-form 3 
scores were generated: the conventional formula score, the score using 
weights derived on a parallel form, and the score derived using weights 
derived by keying on the m-1 remaining items. For each of the three 
scoring methods, alpha coefficients were computed for each sub-form and 
intercorrelations among sub-forms were also computed. Thus, cross -validated 
alpha coefficients and parallel forms reliabilities were obtained for both 
Samples A and B. 

Table 1 shows the cross validated internal consistency coefficients 
for each type of weighting system. The k-values shown reflect the 
proportional increase in test length estimated by the Spearman-Brown formula. 
The results are quite impressive given the crucial assumption that the same 
latent trait or set of latent traits, is being measured by the test. We 
see in Table 2 that the parallel forms reliability estimates follow a 
highly similar pattern with estimates of effective changes in test length 
ranging from slightly more than one and one-half the original for one 
quantitative sub-form to more than twice the original length for the 
verbal forms. 
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These data enabled us to give some pretty solid answers to our 
first question which was, what happens to internal consistency and parallel 
forms reliability when options are empirically keyed? The answer is clearly 
that these measures are increased rather substantially by empirical 
weighting. It is also worth noting that the two types of keying we carried 
out were for all practical purposes identical in their effects and, in fact, 
cross-validated scores yielded by the two methods were correlated close 

to 1.0 (all correlations were .999 or greater). 

The real' test of this procedure came in the next set of analyses 

» 

we performed. For this purpose the answer sheets of over i*,000 college 
students who had taken the GRE at the same administration from which 
we selected our keying samples were scored with formula score weights and 
with empirically derived weights. None of this group were included in 
the keying sample, but were selected based on undergraduate institution 
attended. A total of 2*0 institutions provided cumulative undergraduate 
GPA data for these individuals. Within school sample sizes ranged from 
16 to 3 99 9 with a mean within«achool sample size of 130. Taking pairs 
of verbal and quantitative sub-forms we computed both single order and 
multiple correlations between conventionally scored tests and (PA and 
between empirically weighted scores and GPA. The results were highly 
consistent. Both single order and multiple correlations were slightly 
but consistently higher for the formula scores. The weighted scores produced 
on the average a multiple R .05 less than the multiple R obtained with 
formula scores. In only one case was there a substantial difference in 
favor of the weighted scores (.10). The conclusion that empirical option 
weighting did not lead to any increase in validity was clear enough but 
the reasons for this were not. One would have expected the more reliable 
scores to predict the GPA criterion slightly more accurately. 
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Several explanations were considered. One possibility was that the 

l 

weighted score reliabilities which held up so well in our carefully 
constructed A and B samples broke down in the validity sample. This was 
not the case, however. The reliabilities for the weighted scores were 
consistently and substantially higher in the validation sample. A 
second possibility was that the keying procedure resulted in tests which 
were "factor pure" and because of this were less useful for predicting 
the GPA criterion which is generally assumed to be factorially heterogenous. 

The increased alpha coefficients certainly supported this notion. If this 
second explanation were true, however, one should observe a lowering 
of intercorrelations between the verbal and quantitative sub-tests. But 
this was not the case • The correlation between V and Q in fact was increased 
substantially when empirical weights were applied. This increase was 

also quito a bit more than one would expect from the increases in reliability (see Table 3) 
This led us to consider a third possibility that the empirical 
weighting was ordering people not only on verbal and quantitative ability 
but on same other factor which was reliable but not valid. The pattern 
of intercorrelations between weighted and unweighted scores supports this 
last explanation. Considering the verbal sub-forms only we see in Table li 
that although the correlation between weighted parallel f oims goes up, the 
correlation between the weighted form and the unweighted parallel form 
goes down. The r between PF 1 and F g , for example, is lower than that 
between E^ and E g . If, as we had assumed, we were merely increasing 
the r eliability with which we measured true scores, the correlation 
between. F F^ emct should have increased and this increase should have 
been directly related to the increase in reliability* 
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The pattern was similar for the quantitative sub-forms. 

Our analyses are continuing but at this point we can at least suggest 

what may be happening. The GRE like the SAT is a formula scored test 

which means that an examinees^score is equal to the number of correct 

answers minus 1 times the number wrong. The effective weight for an omit 
k 

under this scoring system is the mean expected score assuming a random 
response to the choices. In the usual case this is zero. Whether these 
assumptions are valid or not is a question which cannot be dealt with here. 
The important point is that the propensity to omit responses (or conversely 
to take risks) is a highly reliable behavior (e.g. Slakter, 1967). 

The procedure we used to key assigned a weight to the omit category 
which did not, in most cases, meet or even come close to meeting the 
formula score condition that the omit category equal the mean expected 
score for the item given a random response to the alternatives. 

If we consider Table 5 we see that the actual weight assigned 
(in the 0 column) differs considerably from what would be the mean expected 
weight (the ( f column) . For some of the verbal items shown examinees 
were actually given a bonus for not responding. In other cases they paid a 
penalty. For the quantitative tests they always paid a penalty which was in 
seme cases quite severe. 

What we are suggesting is that when a test is given with the usual 
guessing instructions the empirical keying procedures described capitalize 
on the tendency to emit and that while this tendency is reliable, it is not 
valid. This would explain the decreases in validity in spite of increases 
in reliability that we observed and would also explain the increase in 
the correlation between V and Q« 
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A new keying procedure which hopefully will offer more promise has 
been worked out and will be applied shortly. This procedure assigns 
weights to responses which are optimum in the least squares sense , 
but subject to the constraint that the weight for omit equals the average 
of the other weights- 
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Table 1 

Cross -Validated Internal -Consistency Coefficients 
for Three Different Sets of Weights 

Sample A 





Formula 


Parallel Forms Keyed 


Internally Keyed 


Form 


of 


cK 


K 1 


<x 


K 


v i 


.8695 


.9285 


1.95 


.9273 


1.91 


V 2 


.8671 


.9259 


1.92 


.9269 


1.91* 




.81*58 


.9105 


1.85 


.911*3 


1.95 


q 2 


.8715 


.911*0 


1.57 


.9113 


1.51 




rc&A- 

C S-i 

4 


Sample 


B 






V 1 


.9297 


1.92 


.9292 


1.88 


V 2 


.8755 


.9308 


1.91 


.9312 


• 1.92 




.8515 


.9131 


1.83 


.9178 


1.95 


Q, 


.8725 


.9161* 


1.60 


.9125 


1.52 



1 K gives the estimated proportional increase in test length which 
would be necessary to yield the increased of's shown. Rearranging 

t 

the Spearman- Brown prophecy formula,. 

K = * w^ 1 ~ 

where « is the <* obtained with, formula, score weights and <* w is 
F 

the of obtained with cross-validated empirical weights. 
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Table 2 

Cross -Validated Parallel Forms Reliabilities 
for Three Different Sets of Weights 



Sample A 

Parallel Forms Keyed 



Internally Keyed 



test 


Formula 


R 


K 1 •• 


R 


K 


V 


.8780 


.91*1*$ 


2.36 


.91*27 


2.30 

a 

a 


Q 


.8722 


.9276 


1.88 


.9183 


1.65 1 






Sample B 




I 


V 


.8909 


.91*79 


2.23 


.91*97 


2.31 ! 


Q 


.871*2 


.9170 


1.99 


.9267 


1.82 | 



^ K gives the estimated proportional increase in test length which 
would be necessary to yield the increased R's shown. Rearranging 
the Spearman-Brown prophecy formula. 



K = 



V 1 - V 



V 1 - V 

where Rp is the R obtained with formula score weights and R w is 
the R obtained with cross-validated empirical weights . 



§ 

k 

I 

I 

% 



o 

ERIC 



11 



* 






- 11 - 



Table 3 

Intercorrelations Between V and Q 
for Three Different Types of Scoring Systems 

Sample A 





Formula 


Parallel^ - Forms Keyed 


Internally Keyed 


V 1 Q 1 


.1*509 


.51*1*0 (.1*823) 


.51*91* (.1*791*) 


V 2 


.1*531 


.5290 (.1*81*7) 


.51*87 (.1*818) 


V l«2 


.1*253 


.5097 (.1*91*9) 


.1*906 (.1*522) 


CVJ 

<y 


.1*286 


.1*931* (.1*981*) 


.1*889 (.1*557) 






Sample B 




v i«i 


.1*151* 


.5300 (.1*1*16) 


.5223 (.1*388) 


v 2 Q x 


.1*190 


.5270 (.1*14*3) 


.5051 (.14*15) 


V l«2 


.1*079 


.1*863 (.14*36) 


.5061* (.1*309) 


V 2 Q 2 


.1*061 


.1*800 (.1*317) 


.1*891* (.1*291) 



^The values in parentheses represent the expected correlation which should 
have resulted from the; increased: reliability of the empirical key scores. 
These values were obtained, by multiplying the true formula score correlations 
between V and Q by- thet geometric mean of the empirical key score reli- 
abilities. Parallel-Tonus reliabilities were used in all cases. 



A 



O 

ERIC 



12 



- 12 - 



O 

ERIC 



Table k 

Sample A Correlations Between Formual Scores 
and Scores Using Weights derived on Parallel Forms 



PF 



PF 



F. 



2 i 



PF, 



2 

.8780 



.9161 

.8509 



2 

.8518 
.9200 
•9k 3k 
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Table 5 



Empirical Option Weights for Selected Items 
Form Sample A 



Item # 


R 


W 

1 


w 

2 


W 

3 


W 

1* 


0 


1 


.11*1* 


-1.180 


-1.128 


- .211 


-1.31*7 


- .1*71* 


11 


.191* 


- .971 


- .530 


- .718 


- .317 


- .Ii5* 


21 


.186 


- .656 


-1.167 


- .95* 


-1.233 


- .753 


31 


.273 


.126 


- .965 


- .073 


- .171* 


- .961* 


1*1 


.199 


- .915 


- .398 


- .631 


-1.018 


-1.396 


51 


.521* 


- .039 


.131 


- .166 


- .318 


- .581 


Item # 


R 


Form Sample A 

w 1 w 2 w 3 


W l* 


0 


1 


.128 


- .731* 


-1.089 


- .631 


- .881 


-1.925 


6 


.11*1 


-.838 


.187 


- .501 


- .921* 


-1.186 


11 


.158 


oo 

tf\ 

• 

1 


- .11*1 


- .10*3 


- .516 


-1.266 


l6 


.397 


- .1*88 


- .585 


- .918 


- .951 


-1.117 


21 


.287 


- .616 


- .027 


-1.178 


- .1*93 


- .71*0 


26 


.666 


.150 


.166 


- .295 


.010 


- .1*77 
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- . 710 * 

- . 1*68 

- .773 

- .166 

- .#3 

.026 

0 / 

- . 61*1 

- .387 

- .292 

- .509 

- . 1*05 

- .139 



